In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
from scipy import stats
from myutil import *
Jeroen Le Maire --- A network tour of data science
This project aims to check if it is possible to discover ouperforming stocks by machine learning. The data is from Professer Milosevic from the university from Manchester. I got in contact with him because of this project. The data contains 1739 stocks and quarterly data over a period from 2012 to 2015. Furthermore it contains about 20 features for every data point. The data was formated in a kind of flash cards, so it took a while to reformat them correctly.
In [2]:
df = load_data()
#Shows which percentage of the data is loaded (long load time)
A quick view to check if everything is ok.
In [3]:
df.iloc[0:10,:]
Out[3]:
First the data points are deleted for which it is impossible the calculate the future return. Those can be divided in two groups. The first group is the more recent data. For them the future return is not available yet. The second group are the datapoints for which the date is unknown. We add a column that contains the return in the next year. A feature called 'breturn' is added. It is a binary column that contains a one if the stock will be performing really good (more than 15 %). The incomplete data is noted as -9999. We change this to NaN for easier filtering.
In [4]:
df=df[df['PX1YR'] != 'empty']
df=df.dropna(subset=['Date'])
df['return']=(df['PX1YR'].astype(float)/df['PXnow'].astype(float)-1)
df['breturn'] = (df['return']>0.15)*1
df.iloc[0:10,:]
df=df.replace(to_replace='-9999',value=float('NaN'))
#df['DIVIDENDY'].fillna(0, inplace=True)
The feature types are transformed to floats to ensure correct calculations. The first 2 features are not tranformed because they contain the name and date. The intermediary data frame is saved as well. With the describe option we get a quick overview of the data. The percentiles are not correct as there are still some incomplete data points. In the next part only the complete data will be used.
In [5]:
df,attributes = columnstofloat(df)
import os.path
df.to_csv(os.path.join('df.csv'))
In [6]:
df.loc[:,attributes].describe().astype(float)
Out[6]:
Let's calculate the percentage of data points that have a positive return after 1 year.
In [7]:
#Percent of companies whos stock price increases
df[df['return'] > 0].shape[0]/df.shape[0]*100
#In the last column we see that the average return is 16.86 %
Out[7]:
As the data contained some outliers, we want to remove these. In some cases, the values are impossibly high. In other cases, they just happen very rare, so you can't get conclusions from that. The z-score is used to delete some data points. In this way, the plots are clearer.
In [8]:
df = remove_outliers(df)
The PE ratio is one of the most important ratio's in fundamental analysis, let's try to find a relation between the PE (price/earnings) ratio and the return of a share. Therefore we split the data into different classes based on their PE value. For each class the average return is calculated. Normally the lower the PE ratio, the higher should be the return. The plot is confirming this. Note: The low values fluctuate a lot as there are fewer data points.
In [31]:
bins = [5,6,7,8,9,10,12,14,16,18,20,25,30,35,40,1000]
labels = [x+0.5 for x in bins[0:15]]
df['bin']=pd.cut(df['PE'],bins,labels=labels)
In [33]:
group = df.groupby('bin').mean()
counter = df.groupby('bin').count()
print(counter.iloc[:5,:1])
corr = df['return'].corr(df['bin'], method='pearson')
group['return'].plot(grid=True, title='Pearson correlation: {:.4f}'.format(corr), figsize=(15,5));
df['BEST_EPS']=df['BEST_EPS']/df['PXnow']
We've seen the correlation between the PE and the return let's check the correlations between the different features and the returns. The next plot shows a dot for every data point. The y-axis shows the return, while the x-axis shows the features. In each plot a stright line is added, it shows the value of the correlation between the features.
In [11]:
import seaborn as sns
x = attributes
y = ['return']
sns.pairplot(df, hue=None, hue_order=None, palette=None, x_vars=x, y_vars=y, kind='reg', diag_kind='hist', markers=None, size=5, aspect=1, dropna=True)
Out[11]:
Here, two plots are described and interpreted. Current market capitalization Some of the smallcaps have higher returns, but some have big losses. The spread of return is a lot higher for small companies. It proves that smallcaps have higher risks. The regression line shows the negative correlation between the return and de market capitalization.
BEST EPS The scatter plot is not that clear, but the regression line shows that there is a relation between the best EPS value and the return. Note that I divided the EPS by the price of the share to make it like a ratio. In fact, it is becoming the "BEST EP", an inverse of the BEST PE.
This is an interactive plot that shows the market capitalization, the BEST EP ratio and the returns for every data point. The color of the dot shows the dividend. The higher the dividend, the greener the dot. In fact, these plots are the same as the ones above, but interactive.
In [12]:
output_notebook()
bokehplot(df)
In the first analysis NaN values don't do any harm as these values are neglected. For further calculations it is important to delete these data as it is influencing the results. Now the data can be summarized and we can do a first check.
In [14]:
df= df.iloc[:,2:].reset_index(drop=True)
df=df.dropna() #Tot hiervoor konden de enkele NA's geen kwaad, maar voor het NN willen we dat niet
df.loc[:,attributes].describe().astype(np.float)
Out[14]:
It is possible to find some conclusions with the 'correlation plots'. Although, it is better the have a numerical value. To calculate the correlation coefficients, the data should be normalised. We see that the dividend, market capitalization, sales growth and best EP have the biggest correlation with the return.
In [15]:
print_correlation_return(df,attributes)
The right columns of the data are splitted in training data and test data and the input is separated from the labels. A permutation puts the data in random order.
In [16]:
# Training and testing sets.
x_test,x_train,y_test1,y_test2,y_train1,y_train2 = split_and_permute(df,attributes,550,5000)
The data is normalised. Note that the normalization of along the data is out commented. After some classification runs, I noticed that the performance increases if I only normalise along the features.
In [20]:
#Normalise along each dimension
x_test_orig = x_test.copy() # This variable is later used for checking the performance of the classifier
x_test -= x_test.mean(axis=0)
x_test /= x_test.std(axis=0)
x_train -= x_test.mean(axis=0)
x_train /= x_test.std(axis=0)
#Normalise along each data
#x_test = x_test.div(np.square(x_test).sum(axis=1),axis=0)
#x_train = x_train.div(np.square(x_train).sum(axis=1),axis=0)
To make sure everything is right, the data is reindexed. The last part of this notebook contains a neural network. That part requires to be run in the docker container with tensorflow. To make this notebook runnable, the data is stored in files. There are two sets of labels, label1 (y_test1 and y_train1) contain the actual value of the return. Label2 is binary. A one means the the stock will increase more than 15%.
In [21]:
x_test_orig.reset_index(drop=True)
x_test.reset_index(drop=True)
x_train.reset_index(drop=True)
y_test1.reset_index(drop=True)
y_test2.reset_index(drop=True)
y_train1.reset_index(drop=True)
y_train2.reset_index(drop=True)
np.save('x_test',x_test)
np.save('x_train',x_train)
np.save('y_test1',y_test1)
np.save('y_test2',y_test2)
np.save('y_train1',y_train1)
np.save('y_train2',y_train2)
#x_test.to_csv(os.path.join('x_test.csv'))
#x_train.to_csv(os.path.join('x_train.csv'))
#y_test1.to_csv(os.path.join('y_test1.csv'))
#y_test2.to_csv(os.path.join('y_test2.csv'))
#y_train1.to_csv(os.path.join('y_train1.csv'))
#y_train2.to_csv(os.path.join('y_train2.csv'))
In [22]:
test_pred = classifiers(x_test,x_train,y_test2,y_train2)
The classifiers don't reach high accuracies which means that it is hard to predict which stocks will perform better.
In [23]:
res = np.matrix([test_pred,y_test2])
In [24]:
false_positive =0
false_negative=0
for i in range(0,res.shape[1]):
if (res[0,i] != res[1,i]) and (res[0,i] != 0):
false_positive+=1
elif (res[0,i] != res[1,i]) and (res[0,i] == 0):
false_negative+=1
print('There were ',false_positive, ' false positives and ',false_negative,'false_negatives')
In [25]:
res.sum(axis=1)
Out[25]:
The next bokehplot shows 3 features and the stocks that will have higher returns according to the classifier are marked green. As you can see, most of the data points with high returns are detected. The plot with the Dividend shows that it is the main feature for classification. Almost all data points below a certain treshhold are marked green.
In [26]:
output_notebook()
bokehplot2(x_test_orig,y_test1,test_pred)
As mentioned before, the performance of the classifier isn't really good. Let's imagine that we would have bought the stocks that are detected by the classifier. The first box shows the average return of all the test data. The second box shows the average return of the stocks that were detected by the algorithm. It is outperforming the 'market' with 3 percent. So that seems quiet ok.
In [27]:
y_test1.mean()
Out[27]:
In [28]:
y_test1[test_pred==1].mean()
Out[28]:
To use the techniques seen in the course, I implemented a three-layer neural network. The three layers are fully connected and contain 200,50 and 2 nodes. The performance is similar to the that from the classifier, although it doesn't seem to work properly. I tried a lot of different learning rates, different sizes of networks, adding dropout, adding regularization with different parameters, different loss fonctions, different amount of runs,... but the network never converged nicely. All the code is in the myutilnn.py.
In [1]:
from myutilnn import *
trainnn()
The basic data analysis gave some good insights and shows that there is no 'Golden rule' to find outperforming stocks. The classifiers didn't perform very well, although it was possible to reach higher returns by using them. It is also possile to do some further optimization, adapt the classifiers, do some feature extraction,... Concerning the Neural network, it is hard to find if there are errors in the code / data or maybe the setup of the network is wrong. Finally, algorithms can help people with investment decisions although some human intervention is usefull.
In [ ]: